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BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS 

By Peter Buhlmann 
ETH Zurich 

We prove that boosting with the squared error loss, Z/2Boosting, 
is consistent for very high-dimensional linear models, where the num- 
ber of predictor variables is allowed to grow essentially as fast as 
0(exp(sample size)), assuming that the true underlying regression 
function is sparse in terms of the i!i-norm of the regression coeffi- 
cients. In the language of signal processing, this means consistency for 
de-noising using a strongly overcomplete dictionary if the underlying 
signal is sparse in terms of the ^i-norm. We also propose here an AIC- 
based method for tuning, namely for choosing the number of boosting 
iterations. This makes L2Boosting computationally attractive since it 
is not required to run the algorithm multiple times for cross-validation 
as commonly used so far. We demonstrate 1/2 Boosting for simulated 
data, in particular where the predictor dimension is large in compar- 
ison to sample size, and for a difflcult tumor-classification problem 
with gene expression microarray data. 

1. Introduction. Preund and Schapire's [11] AdaBoost algorithm for clas- 
sification has attracted much attention in the machine learning community 
(cf. [20] and the references therein) as well as in related areas in statistics 
[1, 13], mainly because of its good empirical performance with a variety of 
datasets. Boosting methods were originally introduced as multiple predic- 
tion schemes, averaging estimated predictions from reweighted data. Later, 
Breiman [1, 2] noted that the AdaBoost algorithm can be viewed as a gradi- 
ent descent optimization technique in function space. This important insight 
opened a new perspective, namely to use boosting methods in contexts other 
than classification. For example, Friedman [12] developed boosting methods 
for regression which are implemented as an optimization using the squared 
error loss function: this is what we call L2Boosting. It is essentially the same 
as Mallat and Zhang's [19] matching pursuit algorithm in signal processing. 



Received March 2004; revised March 2005. 

AMS 2000 subject classifications. Primary 62J05, 62J07; secondary 49M15, 62P10, 
68Q32. 

Key words and phrases. Binary classification, gene expression. Lasso, matching pursuit, 
overcomplete dictionary, sparsity, variable selection, weak greedy algorithm. 

This is an electronic reprint of the original article published by the 

Institute of Mathematical Statistics in The Annals of Statistics^ 

2006, Vol. 34, No. 2, 559-583. This reprint differs from the original in pagination 

and typographic detail. 



1 



2 



P. BUHLMANN 



Recently, Efron, Hastie, Johnstone and Tibshirani [10] made a connec- 
tion for linear models between forward stagewise linear regression (FSLR), 
which seems closely related to L2Boosting, and the ^i-penalized Lasso [22] 
or basis pursuit [5]. Roughly speaking: under some restrictive assumptions 
on the design matrix of a linear model, FSLR approximately yields the set 
of all Lasso solutions (when varying over the penalty parameter). This in- 
triguing insight may be useful to get a rough picture about L2Boosting via 
its relatedness to FSLR: it does variable selection and shrinkage, similar to 
the Lasso. However, it should be stated clearly that the methods are not the 
same; an example showing a distinct difference between L2Boosting and the 
Lasso is presented in Section 4.3. Moreover, we point out in Section 2.1 that 
FSLR and L2Boosting are different algorithms as well. 

As the main result, we prove here that L2 Boosting for linear models yields 
consistent estimates in the very high-dimensional context, where the number 
of predictor variables is allowed to grow essentially as fast as 0(exp(sample 
size)), assuming that the true underlying regression function is sparse in 
terms of the ^i-norm of the regression coefficients. This result is, to our 
knowledge, the first about boosting in the presence of (fast) growing di- 
mension of the predictor. Some consistency results for boosting with fixed 
predictor dimension include [17, 18] as well as [25]. Except for Jiang's [17] 
result, these authors consider versions of boosting either with £i-constraints 
for the boosting aggregation coefficients or, as in [25], with a relaxed version 
of boosting which we found very difficult to use in practice due to the nonob- 
vious tuning of the relaxation, that is, how fast the boosting aggregation co- 
efficients should decay. The result by Zhang and Yu [25] may be generalized 
without too much effort to a setting with increasing dimension of the predic- 
tor variable, but their theoretical work includes only a rigorous treatment 
of the classification problem (besides the above mentioned disadvantage of 
their relaxed boosting algorithm). We believe that it is mainly for the case 
of high-dimensional predictors where boosting, among other methods, has 
a substantial advantage over more classical approaches. Some evidence for 
this will be given in Section 4.1, and other supporting empirical results have 
been reported in [3] in the different context of low- or high-dimensional 
additive models for comparing L2 Boosting with more traditional methods 
such as backfitting or MARS (restricted to additive function estimates). No- 
tably, many real datasets nowadays are of high-dimensional nature. Besides 
the well-documented good empirical performance of boosting, we identify 
it here as a method which can consistently recover very high-dimensional, 
sparse functions. 

We may also view our result as a consistency property for de-noising 
using L2 Boosting with a strongly overcomplete dictionary. In contrast to 
a complete dictionary, for example, Fourier- or wavelet-basis, the strongly 
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overcomplete noisy case is not well understood. Our result yields at least 
the basic property of consistency. 

Besides the theoretical consistency result, we propose here a computa- 
tionally efficient approach for the tuning parameter in boosting, that is, 
the number of boosting iterations. We give an easily computable definition 
of degrees of freedom for L2Boosting, and we then propose its use in the 
corrected AIC criterion. Unlike cross-validation, our A/C-tuning does not 
require boosting to be run multiple times. This makes the AlC-type data- 
driven boosting computationally attractive: depending on the data, it is 
sometimes as fast as the very efficient LARS algorithm for the Lasso with 
tuning by its default ten- fold cross-validation [6, 10]. 

We demonstrate on some simulated examples how our L2Boosting per- 
forms for (low- and) mainly high-dimensional linear models, in compari- 
son to the Lasso, forward variable selection, ridge regression, ordinary least 
squares and a method which has been designed for high-dimensional regres- 
sion [14] . We also consider a difficult tumor-classification problem with gene 
expression microarray data: the predictive accuracy of L2 Boosting is com- 
pared with four other, commonly used classifiers for microarray data, and 
we briefly indicate the interpretation of the L2 Boosting fit along the lines of 
a linear model fit. 

2. Z2Boosting with componentwise linear least squares. To explain boost- 
ing for linear models, consider a regression model 

p 

yi = ^/3,Xp)+ei, i = l,...,n, 
i=i 

with p predictor variables (the jth component of a p-dimensional vector x 
is denoted by x^^^) and a random, mean-zero error term e. More precise 
assumptions for the model are given in Section 3. 

We first specify a base procedure: given some input data {{Xi,Ui);i = 
1, . . . , n}, where Ui, . . . ,Un denote some (pseudo-)response variables which 
are not necessarily the original Yi, . . . ,Yn, the base procedure yields an es- 
timated function 

5(-) =5(x,u)(-)> 

based on X = [^P^]i=i,...,n;j=i,...,p) U = {Ui, . . . ,Un)^ ■ Here, we will exclu- 
sively consider the componentwise linear least squares base procedure: 



(2.1) . Er=i(^?¥ 



cS = argminV(t/,-/3jXp^)'. 
i<i<P i=i 
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Thus, the componentwise hnear least squares base procedure performs a 
hnear least squares regression against the one selected predictor variable 
which reduces the residual sum of squares most. 

Boosting using the squared error loss, L2Boosting, has a simple structure. 
Boosting algorithms using other loss functions are described in [12]. 

L2BOOSTING ALGORITHM. 

Step 1 (initialization). Given data {{Xi,Yi);i = l,...,n}, apply the base 
procedure yielding the function estimate 

where g = 5(x,Y) is estimated from the original data. Set m=\. 

Step 2. Compute residuals Ui = Yi — F^'^\Xi),i = 1, . . . ,n, and fit the 
real- valued base procedure to the current residuals. The fit is denoted by 
^(m+i)^.^ _ ^^■5^ -jj^(-) which is an estimate based on the original predictor 
variables and the current residuals. 

Update 

= ^(n^)(.) _|_ ^('"+^)(-). 

Step 3 (iteration). Increase the iteration index m by one and repeat step 
2 until a stopping iteration M is achieved. 

is an estimator of the regression function E[y|X = •]. L2 Boosting 
is nothing other than repeated least squares fitting of residuals (cf. [3, 12]). 
With m = 2 (one boosting step), it has already been proposed by Tukey [23] 
under the name "twicing." In the nonstochastic context, the -L2 Boosting 
algorithm is known as "Matching Pursuit" [19], which is popular in signal 
processing for fitting overcomplete dictionaries. 

It is often better to use small step sizes: we advocate here using the step 
size u in the update of in step 2, which then becomes 

r2 2^ F^'H-)=^9(-), 

F^"'+^\-)=F^"'\-)+iyg^"'+'^\-), m>l,0<u<l, 

where v is constant during boosting iterations and is small, for example, 
v = 0.1. The parameter v can be seen as a shrinkage parameter or alterna- 
tively, describing the step size when updating i^(™'+^)(.) along the function 
^(m+i) Small step sizes (or shrinkage) make the boosting algorithm slower 
and require a larger number M of iterations. However, the computational 
slow-down often turns out to be advantageous for better out-of-sample em- 
pirical prediction performance; see [3, 12]. 
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2.1. Forward stagewise linear regression. L2Boosting with component- 
wise linear least squares is related to forward stagewise linear regression 
(FSLR), as pointed out by Efron, Hastie, Johnstone and Tibshirani [10]. 
FSLR differs from L2Boosting with componentwise linear least squares in 
the update of the new estimate F^: instead of using (2.2) which becomes 

where Ps is the least squares estimate when fitting the current residuals 
against the best predictor variable x^'^"^+^\ FSLR updates via 

Note that this description of FSLR is equivalent to the one in [10]. In 
our limited experience, FSLR has about the same prediction accuracy as 
L2Boosting with componentwise linear least squares. However, we give here 
two reasons to favor boosting over FSLR. First, the update in FSLR is not 
scale-invariant whereas the boosting update is on the scale of the current 
residuals via the magnitude of the least squares estimate . It implies 

that FSLR is often more sensitive to the choice of v than boosting. In par- 
ticular, in case of an orthogonal linear model, L2Boosting has a uniform 
approximation property for the soft-threshold estimator over all values of 
the threshold parameter, whereas this nice property does not hold anymore 
for FSLR [4]. Second, the number of boosting iterations can be reasonably 
well estimated via degrees of freedom defined as the trace of a boosting 
hat-matrix, as to be described in Section 2.2. Defining reasonable degrees of 
freedom which are simple to compute seems not easily possible for FSLR. 
This has also been pointed out by Efron, Hastie, Johnstone and Tibshirani 
([10], comment after formula (4.11)), and they suggest the computationally 
intensive bootstrap to cope with this problem. 

We emphasize that Efron, Hastie, Johnstone and Tibshirani [10] do not 
advocate using FSLR in practice. They rather focus on the more interesting 
LARS algorithm. 

2.2. Stopping the boosting iterations. Boosting needs to be stopped at 
a suitable number of iterations, to avoid overfitting. The computationally 
efficient AICc criterion in (2.3) below can be used in our context where the 
base procedure has linear components. 

Our goal here is to assign degrees of freedom for boosting. Denote by 

^{j) = x(^')(X(j'))Vl|X(j')f , j = l,...,p, 

the n X n hat-matrix for the linear least squares fitting operator using the 
jth. predictor variable X*--^-* = {x[''\ . . . ,Xn^)'^ only; \\x\\'^ = x^x denotes the 
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Euclidean norm for a vector x S M". It is then straightforward to show [3] 
that the L2 Boosting hat-matrix, when using the step size < < 1, equals 

B„, = I-{I- - i/H^-^'"-!)) •••(/- uH'^^^^), 

where Si G {1, . . . ,p} denotes the component which is selected in the com- 
ponentwise least squares base procedure in the ith boosting iteration. 

Using the trace of Bm as degrees of freedom, we employ a corrected version 
of AIC (cf. [16]) to define a stopping rule for boosting: 

, 1 + trace 



AlCcim) = \og{a') + 



l-(trace(S™) + 2)/n' 

[2.6) n 

= Y.{Yi - iBmY)if, Y = {Yi,..., Y^f. 

1=1 

An estimate for the number of boosting iterations is then 

M= argmin AICdtTi), 

l<m<mupp 

where mupp is a large upper bound for the candidate number of boosting 
iterations. 



3. Consistency of L2Boosting in high dimensions. We present here a 
consistency result for L2Boosting in linear models where the number of pre- 
dictors is allowed to grow very fast as the sample size n increases. Consider 
the model 

Yi = fn{Xi) + Ei, i = l,...,n, 

where Xi, . . . , X„ are i.i.d. with ElX^-'^ P = 1 for all j = 1, . . . ,p„ and ei, . . . , e„ 
are i.i.d., independent from {X^; 1 < s < n}, with E[e] =0. The case with 
heteroscedastic e^'s and potential dependence between and Xi is dis- 
cussed in Remark 1 below. The number of predictors pn is allowed to grow 
with the sample size n. Therefore, also the predictor Xi = Xi^n and the re- 
sponse Yi = Yi^n depend on n, but we usually ignore this in the notation. The 
scaling of the predictor variables E|X(-')p = 1 is not necessary for running 
L2Boosting, but it allows one to identify the magnitude of the coefficients 
Pj^n [see also assumption (Al) below]. 
We make the following assumptions: 

(Al) The dimension of the predictor in model (3.1) satisfies 

Pn = 0(exp(Cn^~^)), n^oo, for some < ^ < 1, < C < 00. 
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(A2) SUp„gp,Ej=ll/^i,n|<00. 

(A3) supi<j<p^^„gN ll^^-'^lloo < oo, where ||X||oo = sup^gQ 1^(^)1 denotes 

the underlying probabiUty space). 
(A4) E|e|* < oo for some s > 4/,^ with £^ from (Al). 

Assumption (Al) allows for a very large predictor dimension relative to 
the sample size n. Assumption (A2) is an ^i-norm sparseness condition (it 
could be generalized to Z)j=i \Pj,n\ — > oo sufficiently slowly as n ^ oo, at the 
expense of additional restrictions on p„). Even if p„ grows, all predictors may 
be relevant but most of them contribute only with small magnitudes (small 
|/9j.„|). Assumption (A2) holds for regressions where the number of effective 
predictors is finite and fixed: that is, the number of f3j^n 7^ is independent 
of n and finite. Assumption (A3) about the boundedness of the predictor 
variables can be relaxed at the price of more restrictive growth of p = pn- 
it suffices that supi<j<p^ < 00 for some s > 4 if p„ = 0{n") where 

a = q(s) > is a number, depending on the number of existing moments s, 
which converges monotonically to 00 as s increases; that is, any polynomial 
growth of Pn is allowed if the number of moments s is sufficiently large. 

Theorem 1. Consider the model (3.1) satisfying (Al)-(A4). Then, the 
boosting estimate F^"^\-) = Fj{^\-) with the componentwise linear base pro- 
cedure from (2.1) satisfies: for some sequence (?7i.„)„gN with m„ — 00 (n ^ 
00) sufficiently slowly, 

Ex\Ft''\X)-f^{X)f = op{l), n^oo, 

where X denotes a new predictor variable, independent of and with the same 
distribution as the X -component of the data {Xi,Yi),i = 1, . . . , n. 

A proof is given in Section 6. Theorem 1 says that L2Boosting recovers 
the true sparse regression function even if the number of predictor variables 
is essentially exponentially increasing with sample size n. Notably, no as- 
sumptions are needed on the correlation structure of the predictor variables. 

For the Lasso, a consistency result for high-dimensional regression has 
been given by Greenshtein and Ritov [15]. We should keep in mind, though, 
that the Lasso is a different estimator than L2Boosting, as will be demon- 
strated empirically with an example in Section 4.3. 

Remark 1. Theorem 1 also holds for possibly heteroscedastic errors Si 
which are potentially dependent on Xi, by assuming {Xi,Yi), . . . , (X„, Yn) 
i.i.d. and suitable moment conditions for 1^. For the case with bounded Yi, 
a proof follows as for Corollary 1 below. 
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3.1. Binary classification. The theory of L2Boosting for binary classi- 
fication with Yi G {0, 1} can be essentially deduced from squared error re- 
gression. Biihlmann and Yu [3] argue why L2Boosting is also a reasonable 
procedure for binary classification. We can always write 

^■"■^^ f^{x)=E[Y\X = x]=¥[Y = l\X = x], ei = Yi-fn{Xi), 

where the £!,...,£„ are independent but heteroscedastic with E[ej] = and 
Var(ej) = — When using L2Boosting, we get an estimate 

for the conditional probability function P[y = 1|X = x], and the L2 Boosting 
plug-in classifier (for equal misclassification costs) is given by Ct\x) = 

\Fir\x)>l/2Y 

The proof of Theorem 1 essentially goes through and we get the following: 

Corollary 1. Consider a binary classification problem with (Xi, Yi), . . . , 
{Xn,Yn) independent and Yi E {0, 1} for alii = 1, . . . ,n. Assume that fn{x) = 
Fn{Y = 1\X = x] = J2f^^(3j^nx'^^^ and (Al)-(A3) hold. Then, for the 
L2Boosting estimate as in Theorem 1: for some sequence {mn)n£N with 
rUn -^00 (n^ 00) sufficiently slowly, 

Kx\Fjr"\X) - f4X)f = op{l), n ^ 00, 

rx,Y[Cn"{X) ^Y]- L„,Bayes = Op{l), n ^ OO, 

where i^n,Bayos denotes the Bayes risk Ex[min{/n(X), 1 — /„(X)}] and X,Y 
denote new response and predictor variables, independent of and with the 
same distribution as the data {Xi,Yi), i = 1, . . . ,n. 

The proof is given in Section 6. 

4. Numerical results. 

4.1. Low- dimensional regression surface within low- or high- dimensional 
predictor space. We consider the model 

X^Mp{0,V), Y = f{X)+e, p £{3,10,100}, 
^^■^> /(X) =a(T/)(l + 5XW +2X(2) +X(3)), e~AA(0,22), 

where a{V) is a scaling factor. The covariance matrix for the predictor vari- 
able X and the factor a{V) are chosen as 



(4.2) 



V = Ip, aiV) = 1 
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for uncorrelated predictors; or for block-correlated predictors, 



(4.3) 



V - 



/I 

b 

c 




Vo 





c 










1/ 



6 = 0.677, c = 0.323, a(y) = 0.779. 



The constant a(V) is such that the signal-to-noise ratio K\f{X)\'^/a'^ is about 
the same for both model specifications. The model (4.1) with either specifi- 
cation (4.2) or (4.3) has only three effective predictors plus an intercept, all 
of them contributing to the regression function with different magnitudes 
(different coefficients). The sample size is denoted by n; that is, we generate 
n i.i.d. realizations {Xi,Yi),i = 1, . . . ,n, from the model. 

We use L2Boosting, using shrinkage factor i/ = 0.1 [see (2.2)] and the 
corrected AIC criterion for stopping the boosting iterations [see (2.3)]. We 
compare it with the Lasso using ten-fold cross-validation for selecting the 
penalty parameter (i.e., using the default setting from the lars package in 
R with ten- fold cross-validation — CRAN [6]), with forward variable selec- 
tion for optimizing the classical AIC criterion, with ordinary least squares 
(OLS) without variable selection and with ridge regression using the "ora- 
cle" ridge-penalty parameter which minimizes the squared error loss over the 
simulations; the last cannot be used in practice but serves as an optimistic 
value for the performance of ridge regression. Table 1 reports in detail the 
mean squared error MSE = E[(/(X) — /(X))"^] where X is a new test ob- 
servation, independent from but with the same distribution as the training 
data. Figure 1 summarizes one of the settings. All results are based on 50 
model simulations. 

For the high-dimensional (relative to n) settings with p S {10, 100}, 
L2Boosting and the Lasso are clearly best for this model with very few 
effective predictors (see Table 1). Figure 1 displays the good performance of 
the corrected AIC criterion in (2.3) for stopping the boosting iterations. A 
detailed comparison of the "oracle" -stopping rule of L2 Boosting which stops 
at the boosting iteration minimizing the mean squared error (see Table 2) 
can be made to the results in Table 1. Obviously, the "oracle" rule can only 
be applied for simulated data. We also include in Table 2 the performance of 
the Lasso with the "oracle" penalty parameter minimizing the mean squared 
error. L2Boosting and the Lasso perform similarly when using the "oracle"- 
tuning parameters (see Table 2), while the differences are somewhat more 
pronounced when comparing AlCc-stopped L2Boosting with Lasso using 
ten-fold CV tuning (see Table 1). 
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We also consider the case when p increases exponentiaUy while n grows 
only linearly. We focus on the model (4.1) with (4.2). The results are given in 
Table 3. For both L2Boosting and the Lasso, the mean squared error exhibits 
only a slow increase as n grows linearly and p grows exponentially; compare 
also with the results from Table 1 with fixed n = 20. For this particular 
example, the Lasso is better for large p = 300 than L2Boosting (but this 
does not imply a general superiority). 

4.2. High- dimensional regression surface with li- coefficients. We con- 
sider here a regression model which fits into the theory of an adaptive 

Table 1 

MSE for L2 Boosting, Lasso, forward variable selection 
( fwd.var.sel.), ridge with oracle" penalty (ridge*) and 
OLSin model (4.1) with (4.2) and (4.3) 



Method 


(4.2), p = 3 


(4.2),p=10 


(4.2), p = 100 


L2 Boost 


1.658 (0.192) 


2.318 (0.238) 


8.792 (0.640) 


Lasso 


1.290 (0.162) 


3.112 (0.463) 


8.080 (0.773) 


fwd.var.sel. 


1.499 (0.215) 


3.648 (0.421) 


13.551 (1.275) 


ridge* 


1.079 (0.117) 


4.436 (0.392) 


25.748 (0.637) 


OLS 


1.103 (0.127) 


5.674 (0.556) 






(4.3), p = 3 


(4.3),p=10 


(4.3), p = 100 


L2 Boost 


1.054 (0.104) 


1.649 (0.181) 


4.643 (0.239) 


Lasso 


1.163 (0.108) 


3.007 (0.509) 


3.453 (0.403) 


fwd.var.sel. 


1.206 (0.104) 


2.893 (0.373) 


12.685 (0.911) 


ridge* 


0.777 (0.079) 


2.442 (0.226) 


20.799 (0.538) 


OLS 


1.103 (0.127) 


5.674 (0.556) 





Sample size n = 20. Estimated standard errors in parentheses. 



Table 2 

MSE for L2Boosting (L'zBoost*) and the 



Lasso 


(Lasso 


*), both with " 


oracle" -tuning 






parameter 




Model 




i2B00St* 


Lasso* 


(4.2), p: 


= 3 


1.103 (0.127) 


1.103 (0.127) 


(4.3), p: 


= 3 


0.891 (0.100) 


1.075 (0.117) 


(4.2), p: 


= 10 


2.193 (0.230) 


2.208 (0.262) 


(4.3), p: 


= 10 


1.404 (0.114) 


1.378 (0.116) 


(4.2), p: 


= 100 


7.583 (0.593) 


7.116 (0.603) 


(4.3), p: 


= 100 


2.995 (0.208) 


2.730 (0.234) 



Model (4.1) with (4.2) and (4.3). Sample size 
n = 20. Estimated standard errors in paren- 
theses. 
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L2Boost 












— Lasso 






UJ 






- - fwd.var.sel. 












OLS 


















n - 












(M - 






AlC-stopped 




1 




1 1 
100 200 


1 

300 


1 1 
4Q0 50Q 



boosting iterations 



Fig. 1. MSE for L2 Boosting as a function of boosting iterations (solid line), with 
AIC c-stopping (dashed line, AlC-stopped ), Lasso (long-dashed line), forward variable se- 
lection (dashed- dotted line) and OLS (dotted line) in model (4.1) with p = 10 and (4.2) 
(top panel) and (4.3) (bottom panel). Sample size n = 20. 



estimation procedure for high-dimensional hnear regression, presented by 
Goldenshluger and Tsybakov [14]. 
The model is 

l3j^N{0,a]), j = l,...,p, e~A/'(0,l), 

where e,X and . . . ,/3p are independent of each other. The values (t| are 
decreasing as j increases. Thus, absolute values of the regression coefficients 
\Pj I have a tendency to become small for large j. A precise description of the 
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model is given in the Appendix. To summarize, the model is such that p = Pn 
and (3j = [ij^nij = 1, . . . ,Pn, depend on n, satisfying with high probability 
sup„£NX]j=i \l3j,n\ < oo, which is our assumption (A2) from Section 3. The 
sample size is chosen as n = 100 and the resulting dimension of the predictor 
then equals p = 23. 

We use L2Boosting, using shrinkage v = 0.1 [see (2.2)] and with estimated 
number of boosting iterations via the corrected AIC criterion as in (2.3), 
and we compare it with the Lasso (using the default setting from the lars 
package in R with ten- fold cross-validation — CRAN [6]), forward variable 
selection for optimizing the classical AIC criterion, with ridge regression us- 
ing ten- fold cross-validation for selecting the ridge parameter, with ordinary 
least squares and with the procedure from [14]. Table 4 displays the results, 
which are based on 50 independent model simulations. The method from 
[14] produced one outlier with very large squared error, but the median of 
the squared errors is still worse than for L2 Boosting, Lasso and ridge, which 
are performing best for this model. 

Moreover, the method from [14] depends on the indexing of the predictor 
variables and is tailored for regression problems where the coefficients f3j 
have a tendency to decay as j increases (e.g., in time series where j indicates 
the jth lagged variable). All other methods do not depend on indexing the 
predictor variables. We also ran the method from [14] on the same model 
but with index-reversed regression coefficients 

(4.5) /?i,.-.,/323 = /323,---,/3i, /3j as in (4.4). 

Table 3 

MSE for L2Boosting with AlC'c-stopping (L2Boost), with 
oracle" -stopping (LiBoost*), for Lasso with ten-fold C'V tuning 
(Lasso), with " oracle" -tuning (Lasso*) 



{n,p) 


i 2 Boost 


Lasso 


L2Boost* 


Lasso* 


(20,3) 

(40,30) 

(60,300) 


1.658 (0.192) 
2.090 (0.199) 
3.652 (0.186) 


1.290 (0.162) 
2.504 (0.274) 
2.136 (0.143) 


1.103 (0.127) 
1.730 (0.169) 
2.372 (0.135) 


1.103 (0.127) 
1.438 (0.120) 
1.855 (0.122) 



Model (4.1) with (4.2). Estimated standard errors in parentheses. 



Table 4 

MSE for L2Boosting, Lasso, the method from Goldenshluger and Tsybakov [14] (G&T), 
forward variable selection (fwd.var.sel.), ridge (ridge) and OLS in model (4.4) 



L 2 Boost 


Lasso 


G&T 


fwd.var.sel. 


ridge 


OLS 


0.132 (0.006) 


0.135 (0.009) 


0.195 (0.047) 


0.279 (0.019) 


0.116 (0.008) 


0.313 (0.017) 



Sample size n = 100. Estimated standard errors in parentheses. 
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The mean squared error (MSE) for the G&T method with (4.5) was then 
0.224 (0.025), which shows very clearly the sensitivity of indexing the vari- 
ables. 

4.3. L2Boosting is different from Lasso. Consider a model with predic- 
tors as in (4.1) and (4.3) with p = 100 but with regression function 

100 

(4.6) /(X) = 0.2 + 0.2^X(j) 

i=i 

and noise e ~ AA(0, 0.5^). The sample size is chosen as n = 20. This model is 
high-dimensional and nonsparse, and it has a high signal-to-noise ratio. 

Since all the predictors contribute equally, we may want to keep many 
of the variables in the model and shrink their corresponding coefficient es- 
timates to zero. However, the Lasso will only allow one to select at most 
min(n,p -|- 1) = 20 predictor variables (including an intercept); see [26]. 
When generating one realization of the model (4.6), L2Boosting with the 
^/Cc-stopping rule selected 42 predictor variables (including the intercept), 
whereas the corresponding number of selected variables with Lasso, tuned 
by ten-fold cross-validation, is only 13. Thus, we have here an example which 
demonstrates a feature of L2Boosting which is qualitatively different from 
the Lasso. 

A comparison in terms of the mean squared error yields 

LsBoost: 9.468(0.251); Lasso: 12.140(0.346); Ridge: 5.548(0.229). 

The methods are described in Section 4.2 (estimated standard errors in 
parentheses). It is no surprise that ridge regression (using ten- fold cross- 
validation for tuning) performs clearly best. It keeps all variables in the 
model and shrinks the corresponding estimates toward zero; this is tailored 
for the structure of the model (4.6) where all the variables contribute equally. 
We also see from the mean squared error that L2 Boosting is quite different 
(in fact better) than the Lasso. It is not difficult to modify this example 
such that ridge regression becomes worse than L2Boosting. 

4.4. Gene expression microarray data. We consider a dataset which 
monitors p = 7129 gene expressions in 49 breast tumor samples using the 
Affymetrix technology; see [24]. After thresholding to a floor of 100 and a 
ceiling of 16,000 expression units, we applied a base 10 log-transformation 
and standardized each experiment to zero mean and unit variance. For 
each sample, a binary response variable is available, describing the sta- 
tus of lymph node involvement in breast cancer. The data are available 
at mgm. duke . edu/genome/dna_niicro/work/. 
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Table 5 

Cross-validated misclassification rates for lymph node breast cancer 

data 



i2B00St 


FPLR 


1-NN 


DLDA 


SVM 


Misclassifications 30.50% 


35.25% 


43.25% 


36.12% 


36.88% 



I/2Boosting with A/C-stopping (L2Boost), forward variable selection 
penalized logistic regression (FPLR), 1-nearest-neighbor rule (1-NN), 
diagonal linear discriminant analysis (DLDA) and a support vector 
machine (SVM). 



We use L2 Boosting although the data have the structure of a binary 
classification problem; Section 3.1 and Corollary 1 yield justification for 
this, and, for example, Zou and Hastie [26] also use a penalized squared 
error regression for binary classification with microarray gene expression 
predictors. The only modification is the ^/C-stopping criterion: instead of 
(2.3), we use 

AIC{m) = -2 • log-likelihood + 2 • trace(Sm), 

with the Bernoulli log-likelihood. Instead of L2Boosting, we could also use 
the LogitBoost algorithm [13]: for stopping, the penalty term in the AIC 
criterion above then needs some modification since LogitBoost involves an 
operator other than Bm- 

We estimate the classification performance by a cross-validation scheme 
where we randomly divide the 49 samples into balanced training- and test- 
data of sizes 2n/3 and n/3, respectively, and we repeat this 50 times. We 
compare L2Boosting with ^/C-stopping (as described above) with four 
other classification methods: 1-nearest neighbor, diagonal linear discrimi- 
nant analysis, support vector machine with radial basis kernel (from the 
R-package el071 and using its default values) and a forward selection pe- 
nalized logistic regression model (using a reasonable penalty parameter and 
number of selected genes). For 1-nearest neighbor, diagonal linear discrim- 
inant analysis and support vector machine, we pre-select the 200 genes 
which have the best Wilcoxon score in a two-sample problem (estimated 
from the training dataset only), which is recommended to improve the 
classification performance; see [9]. Our L2 Boosting and the forward vari- 
able selection penalized regression are run without pre-selection of genes. 
The results are given in Table 5. When transforming the response vari- 
able to y E { — 1/2, 1/2}, that is, subtracting the prior class probability 1/2, 
L2Boosting has a cross- validated misclassification rate of 23.13% [4]. 

For this difficult classification problem, our L2Boosting with componen- 
twise linear least squares (even without centering the response) performs 
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well. It is also interesting to note that the minimal cross-validated misclas- 
sification rate as a function of boosting iterations is 29.25%. It shows that 
the ^/C-stopping rule is very accurate for this example. A method which 
we found to perform better for this dataset is the recently proposed Pelora 
algorithm [7] , which does supervised gene grouping: its misclassification rate 
is 27.88%. 

We also show in Figure 2 the estimated regression coefficients for the 42 
genes which have been selected during the boosting iterations until AIC- 
stopping; the AIC curve is also shown in Figure 2. For comparing the in- 
fluence of different genes, we display scaled coefficients /3j y^Var(X(-?')) which 
correspond to the estimated coefficients when standardizing the genes to 
unit variance. There is one gene whose positive expression strongly points 
toward the class with y = (having negative scaled regression coefficient) 
and there are five genes whose positive expressions point toward the class 
with Y = 1. The smallest scaled regression coefficient corresponds to a gene 
which appears as the second best when ranking all the genes with the score 
of a two-sample Wilcoxon test; the five largest scaled coefficients correspond 
to the Wilcoxon-based ranks 7, 6, 1, 121, 3 among all the genes. But it 
should be emphasized that, as usual, our estimated regression model takes 
partial correlations between the class variable Y and gene expressions (given 
all other remaining genes) into account, which goes well beyond describing 
the effects of single genes only. 



sorted regression coefficients AIC statistic 




10 20 30 40 50 100 150 200 



selected genes boosting iterations 



Fig. 2. Lymph node breast cancer data. Left: scaled regression coefficients /3j •\/Var(X(^)) 
(in increasing order) from L2Boosting for the selected 42 genes. Right: AIC statistic as a 
function of L2 Boosting iterations with minimum at 108. 
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5. Conclusions. We consider L2Boosting for fitting linear models. The 
method does variable selection and shrinkage, a property which is very useful 
in practical applications. This indicates that L2Boosting is related to the ii- 
penalized Lasso, but the methods are not the same. 

As a useful device, we propose a simple estimate for the number of boost- 
ing iterations, which is the tuning parameter of the method, by using a 
corrected AICc criterion. This makes boosting computationally attractive, 
since we do not have to run it multiple times in a cross-validation set-up. 

We then present some theory for very high-dimensional regression (or 
for de-noising with strongly overcomplete dictionaries), saying that if the 
underlying true regression function is sparse in terms of the ^i-norm of 
the regression coefficients, L2Boosting consistently estimates the true re- 
gression function, even when the number of predictor variables grows like 
Pn = 0(exp(n^~^)) for some (small) ^ > 0. Notably, no assumptions are made 
on the correlation structure of the predictors. Thus, we identify L2Boosting 
as a method which is able, under mild assumptions, to consistently recover 
very high-dimensional, sparse functions. 

6. Proofs. We first consider the regression case where the step-size in 
(2.2) equals u = 1. In Section 6.3, we give the argument for arbitrary, fixed 
< < 1. Finally, we present the case for binary classification in Section 
6.4. 

6.1. A population version. The L2Boosting algorithm has a population 
version which is known as "matching pursuit" [19] or "weak greedy algo- 
rithm" [21]. 

Consider the Hilbert space L2{P) = {/; = J /{x^ dP{x) < oo} with 
inner product {f,g) = J f{x)g{x)dP(x). Here, the probability measure P is 
generating the predictor X in model (3.1). To be precise, the probability 
measure P = Pn depends on n since the dimensionality of X is growing with 
n: we are actually looking at a sequence of Hilbert spaces L2{Pn), but we 
often ignore this notationally [a uniform bound in (6.5) will be a key result 
to deal with such sequences of Hilbert spaces] . 

Denote the components of X by 

gj{x) = x'^^\ j = l,...,pn. 

Note that by assumption, \\gj\\ = 1 for all j. Define the following sequence 
of remainder functions, called matching pursuit or weak greedy algorithm: 

^^■^^ r4 = V - {R^^-'f, gsjgsr^ , m = i, 2, . . . , 

where Sm would be ideally chosen as 

5^ = argmax|(i?'-V,g,)|. 
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The choice function Sm is sometimes infeasible to reahze in practice. A 
weaker criterion is: for every m (under consideration), choose any Sm which 
satisfies 

(6.2) |(i?"-V,55„)l>&- sup for some 0<6<1. 

Of course, the sequence i?™/ = R^'^ f depends on 81,82, ■ ■ ■ ,8m, how we 
actually make the choice in (6.2). Again, we will ignore this notationally. 
It easily follows that 

m—l 
j=0 

and 

(6.3) = p'^-Vf - \{R'^-'f,gsJ\'. 

6.1.1. Temlyakov^ s result. Temlyakov [21] gives a uniform bound for the 
algorithm in (6.1) with (6.2). 

If the function / is representable as 

(6.4) f{x) = Y,(3jgj{x), ^|/?,l<i?<oo, 

j j 
which is true by our assumption (A2), then 

(6.5) <S(l + m62)-V(2{2+b))^ 0<6< 1 as in (6.2). 

By construction, i?™/ depends on the selectors 81, ... ,8m in (6.2). The 
mathematical power of the bound in (6.5) is that it holds for any selectors 
81, ... ,8m which satisfy (6.2). In particular, the bound also holds for se- 
quences K^f which depend on the sample size n (since X ^ P = and 
also the function of interest f = fn depend on n). 

6.2. Asymptotic analysis as sample size increases. The L2Boosting al- 
gorithm can be represented analogously to (6.1). We introduce the notation 

n n 

{f,g)^n)=n~'J2f(^^)9{X,) and = n"i ^ /(X,)^ 

i=l i=l 

for functions /, g':M^" R. Without loss of generality (but simplifying the 
notation), we assume in the sequel that = 1 for all j and n (note that 

\\gj\\ = 1 holds already); the justification follows from Lemma l(i) below. As 
before, we denote by Y = (Yi, . . . , Yn)'^ the vector of response variables. 
Define 

RV = f, Rif = f-{Y,g^^)^^^g^^, 

Kf = K-'f - {K-'f,gsJin)9s^, m = 2,3, . . . , 
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where 

Si = argmax|(Y,5rj )(„)!, 

l<j<Pn 

Sm = argmax|(^""V,9i){n)|, m = 2,3, . . . . 

By definition, i?™/ = f — is the difference of the function / and its 
L2Boosting estimate F^. Note that we emphasize here the dependence of 
i?™ on n since finite-sample estimates {R"^~^ f, gj)(^n) are involved. 

6.2.1. A semipopulation version. For analyzing R^f, we want to use 
Temlyakov's [21] result from (6.5). We will apply it to a semipopulation 
version R^f, as defined below [since it seems difficult to establish (6.2) for 
i?™/ directly]. 

Consider 

Rnf = /) 

Rnf = K-'f - {K-'f,gsj9s^, m = 1, 2, . . . , 

where Sm is the selector from the sample version above. 

The strategy will be as follows. First, we want to establish a finite-sample 
analogue of (6.2) for the estimated selectors Sm', this will then allow us to 
use Temlyakov's [21] result from (6.5) for R^ f ■ Finally, we need to analyze 
the difference Rl^^f -R^f. 

6.2.2. Uniform laws of large numbers. 

Lemma 1. Under the assumptions (Al)-(A4), with < ^ < 1 as in 
(Al): 

(i) supi<,.fc<pJn-iEr=i5i(^.)5fc(^.) - E[g,{X)gk{X)]\ = Cn,i = 
Op{n-^/^), 

(ii) snY>i<j<pJn ^Y.l:=i9j{Xi)ei\= Qn,2 = Op{n «/2), 

(iii) supi<,<pJ?i-iEr=i/(^i)5i(^*) - nf{X)g,{X)]\ = Cn,3 = 

(iv) supi<,<p„ j:Ugj{X,)Y, - E[gj{X)Y]\ = C„,4 = Op{n-^/^). 

Proof. For assertion (i), denote M = sup^ |l(7j(X)||oo; see assumption 
(A3). Then Bernstein's inequality yields for every 7 > 0, 



sup 
<p^2exp 



^Y.gj{X,)gk{X,) -n9j{X)9k{X)] 



n 

i=l 



2o ( 7^^^ ^ 



>7 



2(cj2 + M27n-?/2);' 
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where cj^ is an upper bound for Var{gj{X)gk{X)) for all j, k (e.g., ag = M ). 
Since = 0(exp(2C(n^~'^))), the right-hand side of the inequality above 
becomes arbitrarily small for n sufficiently large and 7 > large. 

For proving assertion (ii), we have to deal with the unboundedness of the 
Ej's in order to apply Bernstein's inequality. Define the truncated variables 



.tr 



£■1 5 

sign(ei)M„ 



if \£i\<Mn, 
if leA > M„.. 



Then for 7 > 0, 



n^/^ sup 



n 



1=1 



>7 



< 



n^/^ sup 



n 



+ : 



l<j<Pn 



rfi/"^ sup 

l<i<Pn 



i = l 



>7/3 



n^/^ sup 
I + 11 + III, 



n 



n 



n 



>7/3 



i=l 



>7/3 



since E[g(j(X)e] = E[5(j(X)]E[e] = 0, which we use for ///. We can bound / 
again by using Bernstein's inequality: 

(7V9)ni-« 



(6.6) 



I < Pn2 exp 



2(a2 + M2(7/3)n-5/2);' 



where is an upper bound for Var((/j(X)e*'') (e.g., sup^ ||(7j(X)||^E|e|^ 
When using 

we can make the right-hand side in (6.6) arbitrarily small since Pn 
0(exp(Cn^~^)); thus, for every (5 > 0, 

(6.7) I <5 for n sufficiently large, 7 sufficiently large. 

A bound for // can be obtained as follows: 

// < P[some \ei\ > M„] < nP[|e| > M„] < nM-'E\e\' 
= 0(ni-^«/^) = o(l), n^oo. 



(6. 



since s > 4/^ by assumption (A4). 
For /// we use the bound 



(6.9) 



/// < I, 



n«/2sup, |Efe(X)(e-etO]l>7/3]- 
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Note that by the independence of e (and e''') from gj{X), 

E[g,{X){e-e'^)]=E[gj{X)ne-e% 
Hence, an upper bound is 

\E[g,{X){e-e''^)]\<M\E[e-e'% 
The latter can be bounded as 



|x|>A/„ 



(sign(x)M„ — x) dPe{x) 



< I hx\>M„]iMn + \x\)dPe{x) 

= Mn¥[\e\ >Mn]+ j |X|I[|,|>M„] dPe{x) 



= 0{Mtn + 0(M-^/2) = o{M-') = o{n-^/') 

since s > 4/^ > 4 (0 < ^ < 1). Hence, by using (6.9), 

/// = for n sufficiently large, 7 > sufficiently large, 

and together with (6.7) and (6.8), this proves assertion (ii). 
Assertion (iii) follows from (i): 



sup 



n 



l<j<Pn,neN 






Pn 

<^\Pr,n\ sup 
r=l ^<j,k<Pn 


n 
i=l 


-E[g,{X)gkiX)] 


Pn 

<^\f3r,n\ sup 
r=l ^<j,k<Pn 


n 

n-'J2gj{XMX,) 
1=1 


-E[g,{X)gk{X)] 


Pn 

<Y.\f^r,n\Cn,l=0, 
r=l 


p(n-^/2). 





Assertion (iv) follows from (ii) and (iii). □ 

6.2.3. Recursive analysis of L2Boosting. Denote 



C„ = max{Cn,l,Cn,2,Cn,3,Cn,4} = C'p(n ^' ) 

which is a bound for all assertions (i)-(iv) in Lemma 1. Also, we denote by 
CO a realization of all n datapoints. 
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Lemma 2. Under the assumptions of Lemma 1, there exists a constant 
< C^, < OO, independent of n and m, such that 

sup \{Kf.g,)in) - {Kf,9j)\ < (5/2)"^CnC, 

^<j<Pn 

on the set An = {uj; |Cn('^)| < 1/2}- 
Note that Lemma 1 implies that F[A„] — > 1, n ^ oo. The constant de- 
pends on SUp„gNEj=l \Pj,n\- 

Proof. We proceed recursively. For m = 0, the statement follows di- 
rectly from Lemma l(iv). Denote An{m,j) = {K^ f, gj)(^n) ~ {^^f\9j)- Then, 
by definition, 

An{m,j) = An{m-l,j) - {Rn~^f,9sJ{{9s^,9j)(n) - ^9s^,9j)) 

(6-10) - {9s^^9,)in){{Rn-'f.9sJin) " {Rn~'f,9sJ) 

= An{m - 1, j) - In,m.{j) " IIn,ra{j)- 

From Lemma l(i) we get 

(6.11) sup \Ln,M\ < \\K~'f\\\\9sJCn < ll/IICn, 

where we have used the norm-reducing property in (6.3) for f ■ 
For the second term we proceed recursively: 

sup |//n,m(i)|< sup \{9 s^, 9j) {n)\ SUp |^„(m-l,j)| 

(C lOA l<j<Pn l<i<Pn l<i<P7i 

^ ^ <(1+Cn) sup K(m-l,j)|. 

l<i<Pn 

For the last inequality, we have used again Lemma l(i) and the Cauchy- 
Schwarz inequality \{gg ,9j) \ < Wg^ \\\\9j\\ = 1- 

Using the notation Bn{rn) = sup]^<j<p^ \An{m,j)\, we get the following 
recursion from (6.10)-(6.12): 

Bn{0) < Cn, 

Bn{m) < Bn{m - 1) + Cn||/|| + (1 + Cn)Bn{m - 1) 

< (5/2)5„(m - 1) +Cn|I/ll on the set An- 

Therefore, 

m—l / m—l \ 

Bn{m) < (5/2)-Cn + Cnll/ll E (5/2)^' < (5/2)™Cn 1 + 11/11 E (5/2)^'-" 

j=0 \ j=0 / 

(Pn OO \ 

l + SUpEl/?i,nlE(V2)-M, 
"Gr«j=i k=i J 

which completes the proof by setting C* = l + sup„gf^X]?=i "1 Sfcli(5/2)"'^. 
□ 
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Analyzing R^f- We are now ready to establish a finite-sample analogue 
of (6.2) for /. We have 

Hence, by invoking Lemma 2 (and denoting by An the set as there) we get 

\iKf,9sJin)\= sup \{Rnf,9,)in)\ 

> sup |(i?;r/,*>|-(5/2)'"Cn,a on the set A. 
Therefore, again by Lemma 2 for the first inequality to follow, 

\{Rnf,9sJ\ > \{Kf,9sJin)\ - (5/2)™CnC, on the set A^ 

> sup I I -2(5/2)"^ CnC* on the set 

Consider the set 5„ = {cu;supi<,<pj(i?^/, <7j)| > ^{5/2)^(^0,}. Then, 
by (6.13), 

(6.14) \{R^f,g^J\> 0.5 sup \{Kf,gj)\ on the set ^„ n 

Formula (6.14) says that the selectors Sm satisfy the condition (6.2) for R^f 
on the set An fl Bn- We can now invoke Temlyakov's result in (6.5), since 
the condition (6.2) holds on the set AnCiBn [as established in (6.14)]. We 
have 

(6.15) \\R]:^f\\<B{l + m/4y^/^^ = o{l) on the set A„ n S„ 

by choosing m = rUn ^ oo (n — > oo) (slow enough), where B = 
supn^fqJ2^Zi \Pj,n\ < oo; See (6.4) and assumption (A2). 

For ujeB^ = {a;;supi<,.<p„ \{Kf,gj)\ < 4(5/2)-^^}, by using for- 
mula (5.2) from [21] with bm as defined there (i.e., bm = bm-i + 1 {R^"^ f, g^ )|, 
bo = l), 

\\Kff< sup \{Kf,gj)\bra 

(6.16) < sup |(i?™/,g,-)|(l+m||/||) 

< 4(5/2)"Cna(l +m||/||) on the set B^ . 

For bounding the number bm, we have used the norm-reducing property in 
(6.3) applied to Ri;^/. Therefore, using (6.15), (6.16) and Cn = Op{n~^/'^) 
from Lemma 1, we have for m = rUn oo [n ^ oo) slow enough [e.g., m„ = 
o(log(n))], 

||i?-/||<i?(l + m„/4)-Vio 

(6.17) + 4(5/2)"'"Cna(l + 7n||/||) on the set {An n Bn) U B^ 
= op(l), 

since P[(^„ fl Bn) U S^] > P[^„] ^ 1, n ^ cx), due to Lemma 1. 
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Analyzing R!^/. By definition and using the triangle inequality, 

(6.18) ||Fr - /II = WRnfW < /II + ll^n / " Kfl 

A recursive analysis can be developed for the second term on the right-hand 
side: 

An{m) = \\Kf-R::f\\. 

By definition, 

<A4m-l) + Ki?rV,95„>(n) - {K"'f,9sJ\\\9sJ 

< Anim - 1) + (5/2)™-^ CnC'* on the set A^, 

where the last inequality follows from Lemma 2. Therefore, for some constant 
C>0, 

(6.19) p-/-iZ-/||<3™CnC = op(l) 

by choosing m = nin — > oo sufficiently slowly such that 3™"Cn = op(l). 
By (6.17)-(6.19) we get [e.g., by using the choice m„ oo, m„ = o(log(n))] 

Exl^r- W - /Wl' = WK-ff = opii), 

which completes the proof of Theorem 1 . 

6.3. Arbitrary step-size v. For arbitrary, fixed step-size ^ < u <1 in 
(2.2), we need to make a few modifications to the proof. 
Temlyakov's result in (6.5) becomes 

ll/?"^/!! <5(l + zy(2-i/)mfe2)"^/(2(2+6))^ 0<6< 1 as in (6.2). 

Proof. The claim follows as in [21]. Using his notation, we use am = 
\\R^f\?, Vm = \ {R"'~'^f,gs^)\, Ki = bm-i + yyniM = 1 and tm = b from 
(6.2). □ 

We can then use exactly the same reasoning as in Section 6.2. At some 
obvious places, a factor occurs in addition, and it can be trivially bounded 
by 1. The only slightly nontrivial reasoning occurs in (6.16); but using bm 
as defined above (applied now to K^f instead of R^f) yields the bound 

||i?-/||2< sup \{Kf,g,)\bm, 



which then allows us to proceed as in Section 6.2. 
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6.4. Binary classification. The first assertion of Corollary 1 follows ex- 
actly as in the proof of Theorem 1 by using the representation in (3.2). There 
is no crucial place where we make use of homoscedastic errors : the uniform 
laws of large numbers from Lemma 1 look formally a bit different (e.g., for 
(ii) we need to subtract a term ¥,[gj{Xi)ei] = E[gj{Xi){Yi — f{Xi))]), but the 
i.i.d. structure of the pairs {Xi,Yi) suffices to get through. The moment as- 
sumption for ei = Yi — f{Xi) trivially holds since < 1 and sup^ \f{x)\ < 1. 

For the second assertion, it is well known that 2 times the Li-norm bounds 
from above the difference between the generalization error of a plug-in classi- 
fier (expected 0-1 loss error for classifying a new observation) and the Bayes 
risk ([8], Theorem 2.3). Furthermore, the Li-norm is upper bounded by the 
L2-norm. 

APPENDIX 

The model (4.4). The model (4.4) is as follows. Define aj = jO-^i. Let the 
parameter k be the solution of the equation a'^n~^ Sj^i ^j^j ~ where we 
denote Xj = (1 — KOj)+. For n = 100, the solution is k = 0.199. Determine 
the predictor dimension p = maxj {ttj <K-^} = 23. The variances are (t| = 
\j{nKaj)~^ , j = I, ... ,23, n = 100. It can be shown that such regression 
coefficients belong with high probability to {{Pj,n)j', Z]j=i f^j/?|,n ^ 1} (note 
that p = Pn depends on n via the parameter k — k^). 
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